2017-06-24

爬虫笔记2

handling redirects

server-side redirect, depending on how it is handled, can be easily traversed by Python’s urllib library without any help from Selenium;

client-side redirects won’t be handled at all unless something is actually executing the javascript.

selenium is capable of handling these Javascript redirects in the same way; when to stop page execution? how to tell when a page is done rediecting?

Detect that redirected in a clever way by watcing an element in the DOM when the page initially loads, then repeatedly calling the element until Selenium throws a StaleElementReferenceException, the element is no longer attached to the page’s DOM and the site has redirected.

Image Processing and Text Recognition

Pillow

Pillow allows you to easily import and manipulate images iwth a variety of filters, masks, and even pixel-specifc transformations.

from PIL import import Image, ImageFilter

kitten = Image.open(“kitten.jpg”)

blurryKitten = kitten.filter(ImageFilter.GaussianBlur)

blurryKitten.save(“kitten_blurred.jpg”)

blurrykitten.show()

for more useful, http://pillow.readthedocs.org/

Tesseract

scrape text from images on webste.

Bowen He's Blog

劝君惜取少年时.

爬虫笔记2

handling redirects

Image Processing and Text Recognition

Pillow

Tesseract